home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Aminet 21
/
Aminet 21 (1997)(GTI - Schatztruhe)[!][Oct 1997].iso
/
Aminet
/
comm
/
www
/
HTTX.lha
/
HTTX.doc.eng
< prev
next >
Wrap
Text File
|
1997-07-16
|
26KB
|
802 lines
HTTX
HTML to TEXT converter
Created by Gabriele Favrin
(E-Mail: favrin@tin.it - FidoNet: 2:333/726.8)
Version 1.5 - May 1997
Index:
1. Utilization terms
2. Properties of HTTX and distribution terms
3. Introduction
4. Hardware requirements and installation
5. How to use
5.1 Command line parameters
5.2 External configuration
6. Error Messages and AmigaDOS Return Codes
7. FAQ (advices, interfacing with other programs and more)
8. Technical informations
9. How to contact the Author
10. Greetings
11. Program history
12. Future versions
-----------------------------------------------------------------------------
*** CHILDWARE ***
This software is "CHILDWARE". The author explicitly asks whoever uses this
program to make a donation toward a beneficial corporate body which works in
help of the children, in any form or way.
If you don't know any, ask your local post office and inform yourself on how
to make a payment to The UNICEF.
The amount of the offer is what you want, but please do it!
-----------------------------------------------------------------------------
1. Utilization terms
---------------------
Before launching this program on your computer, please read carefully the
following paragraph, and continue only if you agree with the terms written
below.
THE HTTX AUTHOR IS IN NO WAY RESPONSIBLE FOR MORAL AND/OR MATERIAL DAMAGES
THAT HIS PROGRAM MAY CAUSE TO PEOPLE OR THINGS. THE PROGRAMMER GAVE THE BEST
FOR LIMITING THE PROBLEMS THAT HTTX MAY CAUSE, BUT HE IS NOT ABLE TO
GUARANTEE THE EFFICIENCY IN ALL THE SITUATIONS. USING HTTX, YOU, THE USER,
ARE RESPONSIBLE FOR ALL MORAL, MATERIAL, CIVIL AND PENAL THINGS.
WARNING:
Many of HTML documents are under Copyright, and are not freely distributable,
nor if converted in plain text format. The author decline every
responsibility for the utilization of the files generated with HTTX.
All of the programs mentioned in this document are properties of their
respective owners.
2. Properties of HTTX and distribution terms
---------------------------------------------
The executable program, the source code and the ideas on its basis are
EXCLUSIVE PROPERTY of Gabriele Favrin. All rights reserved.
HTTX is freeware, NOT public domain. It may be spread only if the executable
file and the documentation will remain unchanged. It's allowed to distribute
the files in archives different than LhA, but it's not allowed the
compression of the individual files using PowerPacker or similar tools.
The insertion of HTTX or its parts in cover disks of magazines is granted
only with the authorization of the author.
Commercial utilization of this package is exclusively allowed to AmiTrix and
Yvon Rozijn (AWeb).
Aminet, Fred Fish, Meeting Pearls and Amy Resource staffs are authorized to
include HTTX in their public domain software collections.
3. Introduction
----------------
HTTX (HTml > TXt) is a program born to convert files from HTML format, used
for viewing files on World Wide Web, to pure ASCII. There are analogous
products, but since none of that has completely satisfied my needs, I started
to write one by myself.
I don't say this is the best or the fastest one but surely it has some
functions unpublished on similar Amiga programs till now.
4. Hardware requirements and installation
------------------------------------------
Required system:
Amiga, 512K, Kickstart 2.04 (37.175) or above.
Required memory:
The size of the file to convert to, and about a 15K, for buffer and other.
Install:
Copy HTTX in your C: directory or in your current path.
5. How to use
--------------
HTTX MUST be only launched from Shell.
Command syntax:
HTTX InputFile [OutputFile] [options]
Parameters in square brackets are optional. It's simply required to provide
a valid HTML file ("InputFile").
If there is no OutputFile specified, the OutputFile will use the following
notation by default: InputFile.txt (e.g. "test.html" will be saved as
"test.html.txt"). If a path is specified for OutputFile, that file will be
saved in the specified directory.
Examples:
-> HTTX data:txt/html/aboxe.html
The file "aboxe.html" will be saved in the directory "data:txt/html/" as
"aboxe.html.txt"
-> HTTX data:txt/html/aboxe.html ram:aboxe.txt
The file "aboxe.html" will be converted and saved into ram disk as
"aboxe.txt"
-> HTTX data:txt/html/aboxe.html data:txt/
The file "aboxe.html" will be saved in the directory "data:txt/" and
converted as "aboxe.html.txt"
5.1 Command line parameters
---------------------------
HTTX offers many options on how to control the generated document.
LEN
Maximum length for every line in the converted file.
Default: 77 - Minimum: 15 - Maximum: 255
INDENT or IN
Number of spaces for indentation (re-enter to the right) of the <UL> and
<DL> lists. The specified value must allow at least two levels of
indentation regarding of the line length specified with the LEN option.
Default: 3 - Minimum: 1 - Maximum: (LEN value - 10) / 3
ANSI or AN
ANSI conversion of HTML styles and LINKS (HREF and NAME).
Not to use if the converted text will go on message areas, like Fidonet or
Usenet newsgroups.
Default: OFF (styles are not converted).
7BIT
Conversion of HTML entities (accent letters, symbols, and so...) to
ASCII codes lower than 128. It's fundamental for text forward on nets
like FidoNet, where the character codes allowed must range between 32
and 127.
IMPORTANT: the ANSI option adds Escape codes (ASCII 27), forbidden on
FidoNet, and strongly not recommended for a non personal use (broadcast)
of converted text.
Default: OFF (8 bit chars are not converted).
HRMODE or HR
HTML documents often have the <HR> TAG, which defines a separating line
between paragraphs. HTTX allow the management of these lines in many
ways:
HRMODE=0
No lines drawn.
It was NOHR in previous versions of HTTX.
HRMODE=1
Lines are drawn using the minus "-" character.
HRMODE=2
Lines are drawn using underlined spaces (in ANSI). This mode
generates a more beautiful line, but introduces ANSI codes,
absolutely to avoid if the text will go on Fidonet or Usenet
newsgroups.
Default: HRMODE=1 (lines are inserted using the minus "-" character).
NOALIGN or NA
HTTX supports (right or center) alignments of texts and separators (<HR>).
Examples:
centered text
right-aligned text
If NOALIGN option is ON, both the above lines will start on left margin,
this saves characters.
Default: OFF (alignment is rightly converted).
FILENOTE or FN
Saves the document title (<TITLE>) as output file comment. Only first 64
characters are saved, as HTML standard.
Default: OFF (title is not saved as comment, but inside the file).
SITE
Adds the specified URL to output file.
Useful to remember from which document the file comes from.
Example:
HTTX ram:children.html SITE=http://www.unicef.org
will start the file with "URL : http://www.unicef.org"
Note: this option has priority on GETNOTE, so specifying a site this
will be used, also if that option is active.
Default: OFF (without this option the URL will not be added).
GETNOTE or GN
Uses InputFile comment as URL. This option replaces SITE, and it is
useful with files saved using AWeb browser, and all other browsers which
save the URL in the comment. AmigaDOS comment length limit is 80 chars.
Default: OFF.
NOHEADER or NOHEAD
Do not insert the title (<TITLE>) and the URL in converted file. This
option automatically turns off the SITE/GETNOTE options, if present.
Default: OFF (title and URL may be added).
HREF or LINK
Adds addresses of the (<A HREF>) TAGS (links).
Very useful if the document contains links you want to keep.
Default: OFF (links not added).
IMG
Adds the ALT-text of images (<IMG>) to output file.
Useful if the document contains images with descriptions.
Default: OFF (ALT-text not added).
BADHTML or BAD
Partial support for documents created out of HTML standard. Use this
option only if the converted page is somewhere textless. Using this
option with correct HTML documents, may cause unpredictable results in
the converted document.
Default: OFF (HTTX uses standard DTD rules).
FORCE
Forces conversion of input file without checking if it is an HTML
document.
USE IT AT YOUR OWN RISK: conversion of text or binary files may cause
unpredictable results.
Normally, HTTX considers valid an HTML file when:
- it has an .html or .htm extension
- starting TAG is <HTML>
- starting TAG is <!DOCTYPE ...>
This option should be specified if the three above conditions are false,
even if the file is an HTML document.
Default: OFF (automatic check of file).
STDIO
Show the converted file on the screen without saving it on disk. This
option automatically enable the QUIET option.
Default: OFF (converted file is saved to disk).
PRINT
Prints the document instead of displaying or saving it.
printer.device will convert standard ANSI codes and end-of-lines to the
ones used by specific Printer, set with the "Printer" program, located
in the "Prefs" drawer.
This option is required if you want to print the converted document,
especially if the ANSI option is enabled, because ANSI codes used for
conversion are different from the printer ones, which are more generic.
Previous version of HTTX used a solution like "HTTX aboxe.html prt:",
which is now to avoid.
The option automatically enables QUIET option, and turns off FILENOTE
and STDIO options.
Default: OFF (document is displayed on screen or saved to a file).
APPEND
Normally HTTX overwrites existing file.
If APPEND is ON, the converted text will be added to the end of the
specified file.
Default: OFF (overwrite output file, if already exists).
NOCFG
HTTX loads a default configuration from ENV:httx.prefs (if another is not
specified in the CFG option). If this option is ON, HTTX uses the default
values for the options or the parameters specified in the command line.
For informations on external configuration, see section 5.2.
Default: OFF (HTTX searches for the configuration).
CFG
With this option is possible to specify the name of the configuration file
used by HTTX. This file MUST be present in the ENV: directory.
This options turns NOCFG OFF.
For informations on external configuration, see section 5.2.
Default: OFF (HTTX loads the httx.prefs configuration file).
QUIET
Do not display any HTTX message. This option is useful when the program
is used within a script.
WARNING: if active, this option also hides error messages. Anyway, the
AmigaDOS error codes are always returned.
Default: OFF (HTTX output is displayed).
If not specified, HTTX uses the default settings.
When conversion is finished, if QUIET option is OFF, HTTX will show:
- size of input and output files.
- the presence of 8 BIT chars and their conversion if active.
- TABs or ASCII chars less than 32 not converted because included in
pre-formatted text.
- non-standard HTML comments, which can make invisible parts of document.
If the converted file appears incompleted, try using BADHTML option.
5.2 External configuration
--------------------------
HTTX supports an external configuration, that is a text file that includes
the most used options, not to be typed every time you use the program.
By default (except when NOCFG option is set, or CFG option with a different
filename) HTTX searches the "httx.prefs" file in the ENV: directory.
It's possible to create multiple configurations, for example one to use for
file conversion and another one to use for print, creating different
configuration files and enabling the CFG option with the name of the file
(it's unnecessary to specify ENV:).
Example:
-> HTTX aboxe.html
Converts "aboxe.html"; using default configuration (httx.prefs).
-> HTTX aboxe.html PRINT CFG=httxprt.prefs
Converts "aboxe.html" using the configuration file called httxprt.prefs
located in the directory ENV:.
Allowed parameters
------------------
External configuration supports a subset of available command line options.
Each option MUST be specified in its extended form (for example ANSI, not AN,
INDENT instead of IN, and so on).
The file must contain just the options and their possible parameters. It's
allowed to write an option per line, for better readability.
Available options (for description see section 5.1) are the following:
LEN - maximum length of lines.
INDENT - size of indentation.
ANSI - allow use of ANSI codes in conversion.
7BIT - conversion of HTML entities with ASCII code >127 to 7 bit chars.
HRMODE - line tracing mode.
NOALIGN - turn off center or right alignment.
FILENOTE - save HTML document title as file comment.
GETNOTE - use of original file comment as URL in destination file.
NOHEADER - don't add URL and title of original HTML document.
HREF - add links (<A HREF>) to destination file.
IMG - add ALT-Text of images to destination file.
BADHTML - partial support for badly written HTML.
Parameters specified on command line override the ones specified in
configuration file. If the options exist, they will be toggled. Examples:
If a configuration file has the following line:
IMG GETNOTE LEN=70
and on command line you write:
-> HTTX aboxe.html IMG
the result is IMG turned on because it's present in configuration line, but
turned off again because it's also present in command line.
-> HTTX aboxe.html LEN=74
LEN is both present in configuration file and command line, but this one
overrides the previous value. LEN is now set to 74.
How to create an external configuration
---------------------------------------
External configuration files are in effect system variables, managed with
SetEnv and GetEnv.
System variables are located in ENVARC: directory (on disk) and ENV:
(generally on RAM). So, the contents in ENV: are valid only for the current
session, while the contents in ENVARC: are also valid after a reset.
To permanently save a configuration file, copy it both in ENV: and ENVARC:.
HTTX configuration may be fully managed using the plugin for the AWeb WWW
browser.
From shell:
- creation of configuration file
SetEnv httx.prefs "options"
httx.prefs is the default filename. You can use the name you want.
Options must be included between quotation marks to be compatible with some
kind of shells (like csh).
The configuration file must be saved to ENV: (to use in current session)
and to ENVARC: (to future use, after having turned off the machine).
- configuration editing
Use your favourite text editor (Ed, Cygnus Editor, GoldEd, and so on).
Remember to save the file to ENVARC: if you want to permanently store
changes.
6. Error Messages and AmigaDOS Return Codes
--------------------------------------------
When execution terminates, HTTX puts out the appropriate AmigaDOS Return Code
(RC), reusable within scripts to determine if the conversion was successfull
or not. See your AmigaDOS handbook for a complete list of error codes.
In case of error, if QUIET is off, the appropriate AmigaDOS message will be
displayed.
Following there's a list of the most common errors. If the system is
localized, messages are displayed in the appropriate language. See your
AmigaDOS manual for further information.
Argument line invalid or too long
Arguments entered in a wrong way.
*** Break
The user has pressed Control-C keys, interrupting the conversion, and the
output file has been removed.
Not enough memory available
There is no available memory to allocate the buffers used by HTTX.
Object not found
Specified file doesn't exist or it's unaccessible.
Object is not of the required type
The input file seems non to be an HTML document.
Try using the FORCE option.
HTTX can show other errors, due to wrong use of commands or options:
The line length must be within 15 and 255 characters (current is NN)
The line size specified with LEN parameter is a number less than 15 or
more than 255.
Indentation size must be at least 1 (current is NN)
The indentation size must be at least one character wide.
With line length XX, indentation size YY, max indent level is ZZ.
You must allow at least 3 indentations.
The maximum indentation level, with this line size and indent value, is
less than 3.
HRMODE value must be 0, 1 or 2
Value set for HRMODE is not valid. It must be 0, 1 or 2.
Finally, there are a few warnings which may be displayed. The conversion
will be executed, but there can be situations altering the final result:
Error in env config. HTTX will use its defaults
External configuration has errors. HTTX will use the default settings and
the parameters passed to the command line.
ENV config 'NAME' not found.
The configuration specified with CFG option was not found.
HTTX will use the default settings and the parameters passed to command
line.
Found non-ASCII chars in preformatted text!
In a non formatted text section HTTX found some 8 BIT characters.
Do not ignore this warning if the converted text have to be posted in
Fidonet conferences or Usenet newsgroups.
This file contains non standard HTML comment(s)!
File could be not completely converted, since non standard HTML comments
were found. If this is the case, try using BADHTML option.
7. FAQ (advises, interfacing to other programs and more)
---------------------------------------------------------
Q. "ANSI styles (bold, italic, underline, blue) stop after first line."
A. ANSI standard says the end of line don't cause a style to stop. So, the
problem is in the text viewer programs, not in HTTX. Standard shell and
Multiview, for example, work correctly.
Q. "Converted text from HTTX appears centered, but in the original document
it isn't."
A. This can happen with bad written pages, with alignment TAGS followed by
tables (which are left aligned by default). To solve the problem use the
NOALIGN option.
Q. "Sometimes alignment doesn't work, wordwrap and lists are not correctly
formatted".
A. It's the text included between <PRE> or <LISTING> TAGS. HTTX copies the
text as is, without formatting. This choice was made because often that
kind of text contains sources that the author wishes to keep as is.
Q. "Some pages are not correctly converted..."
A. There may be many reasons: layout based on tables (actually not fully
supported. See section 8), errors on HTML source (HTTX is quite tolerant,
but there are limits) or errors on HTTX engine. If you think the page is
correct, send me an E-mail with its URL.
(E-Mail: favrin@tin.it, FidoNet: 2:333/726.8)
Q. "How can I directly use HTTX from AWeb or Directory Opus?"
A. About AWeb, users of version 3.0 or better can use the enclosed Arexx
plugin.
HTTX can be used from Directory Opus by creating a button configured as
follow (Directory Opus 4.12):
New Entry/AmigaDOS:
C:HTTX {f} {d}
(replace C: with path where HTTX is)
With this configuration, file selected from "source" directory will be
converted to text and saved to the "destination" directory.
By activating "Do all files" flag it's possible to convert more than one
file, by selecting them and clicking the HTTX button.
Q. "How can I improve the performance of HTTX?"
A. To speed up the conversion, try using a filesystem with block of 1024
bytes, like RAM disk. If memory is almost full or fragmented, saving to
RAM disk will slow down the conversion process.
8. Technical informations
--------------------------
This section talks about some thematics of HTML and its implementation in
HTTX. The reading of this paragraph is not required to learn the use of
HTTX.
What is supported:
- Entities described in RFC 1866, © and ® (NHTML).
- Separators (<DIV>, <BR>, <P>, <HR>) and width font change (<H1>...<H6>).
- Alignment (center and right) of text (headers and paragraphs) and
separators.
- Physical and logical styles.
- Numbered (<OL> with possible START attribute), not numbered (<UL>) and
definition (<DL>) lists until a maximum of 255 levels.
- Document title (<TITLE>).
- Links (<HREF>), user maps and inline images (<IMG> with optional
ALT-text).
- Pre-formatted text (<PRE> and <LISTING>).
- Non standard use of "<" and ">" in a preformatted text (it may be
changed).
What is not supported:
- Tables (<TABLE>). Anyway, the readability of text in tables has been
improved in this version.
These TAGS will be probably implemented in future releases of HTTX.
- Conversion of entities in sequences that can be handled by Spot, Mail
Manager, and similar. I have tried, but the number of chars required even
for one BOLD line is too high (every word should be preceded and followed
by an "*").
- <SCRIPT>, <APPLET>, <STYLE>: I can't see how could I support them,
so I skip them at all.
Implementation of the standard:
- Unknown TAGS are ignored.
- Double spaces, trailing and leading blanks for each line are removed.
- Non ASCII chars (<32) are converted to spaces.
- EOL PC (CR+LF) and MAC (CR) are converted to Amiga format (LF).
- For better readability of text in tables, <TD> is converted in a space and
<TR> TAGS are converted to EOL. Between two <TR> (a table row) is
inserted maximum one separator (<HR>).
- Consecutive EOLs are reduced to one EOL (except for <BR>).
9. How to contact the Author
-----------------------------
Beyond every communication, problem, bug report, advise or other things, a
comment about HTTX will be appreciated, and segnalation of actions toward
corporations who takes care of children (see CHILDWARE).
E-Mail : favrin@tin.it
FidoNet: 2:333/726.8
Please write me in italian or english, thank you.
HTTX support page:
http://freepage.logicom.it/poing/httx/index.html
(AKA http://www.logicom.it/personal/poing/httx/index.html )
10. Greetings
--------------
Betatesting of version 1.5:
Claudio Mazzuco kirk@maya.dei.unipd.it
William Parker wparker@bill.amitrix.com
Giuseppe Pasanisi peppe@mail5.clio.it
Giuseppe Ammendolia ryuga@freenet.hut.fi
This english documentation (translated from the italian one):
Fabio Belli zak@anturio.com
Betatesting of AWeb Plugin:
William Parker wparker@bill.amitrix.com
Dale Currie dalec@zorro.amitrix.com
Very much thanks to Yvon Rozijn for having wanted HTTX inside AWeb II and for
this great year!
Thanks to W. v. Oortmerssen for the splendid AmigaE, used to realize HTTX.
Finally special greetings to whom wrote me about HTTX and to whom use it!
11. Program history
--------------------
V1.0 (July 1996)
First public release.
V1.1 (November 1996)
+Improved speed.
+Added STDIO, QUIET, BADHTML, NOHEADER, GETNOTE options.
+Added AmigaDOS return codes support.
+Added <ADDRESS> and <LISTING> TAG support.
+Modified <BR> and <LI> management, as requested by HTML 3.2 standard.
+Rewritten entities management: now they will be converted even if they
are not closed with ";".
+Extended 7 bit conversion, now faster, more complete (almost all
characters are converted) and more accurate (no accent letters are left in
words, for example "HTTX è bello" becomes "HTTX e` bello" while
"Belphégor" will become "Belphegor".
+As popular demand, default 7BIT option is now OFF.
+Improved support for multiline comments and badly written HTML.
+Added support for TAGS with LF or "<" and ">".
-Fixed many little aesthetical bugs in conversion (double spaces in some
cases, bad chars exiting from <PRE> and <SCRIPT> and other).
-Fixed an error that could cause a lock on file to convert.
-Fixed an error in FILENOTE option.
V1.1a (January 1997)
-Removed a stupid and rare bug in TAGS management.
-Fixed support for MAC EOLs.
V1.1b
-Fixed management of some entities.
V1.5 (May 1997)
+Speeded up very much the program, due to a complete rewrite of HTML parser
and optimization of many functions.
+Added options HRMODE, NOALIGN, PRINT, APPEND, NOCFG, CFG.
+Added alignment support (center and right) of text and separators.
+Added support for various TAGS and HTML attributes (like START in numbered
lists).
+Optimized the ANSI output.
+Added support for ANSI separators.
+Added support for external configuration.
+Improved entities support, now identified, despite of closing character.
+More clear HTTX internal error codes. DOS errors substituted with
PrintFault() of AmigaDOS.
+Added support for "HTTX source_filename destination_path".
+Improved support of separators in tables.
+Now HTTX follows with higher fidelity HTML DTD for many TAGS.
-Fixed some bugs in wordwrap.
-Fixed all signaled bugs in version 1.1b.
+... Many many other improvements!!!
12. Future versions
--------------------
HTTX is a program in continuous growth, because I daily use it, so I notice
lacks or possible improvements.
-----------------------------------------------------------------------------
*** CHILDWARE ***
This software is "CHILDWARE". The author explicitly asks whoever uses this
program to make a donation toward a beneficial corporate body which works in
help of the children, in any form or way.
If you don't know any, ask your local post office and inform yourself on how
to make a payment to The UNICEF.
The amount of the offer is what you want, but please do it!
-----------------------------------------------------------------------------